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An algorithm has been devised to compute the inner and outer product between two arbitrary 
multi-dimensional arrays A and B in a single piece of code. It was derived using A Mathematics of 
Arrays (MoA) and the -^-calculus. Extensive tests of the new algorithm are presented for running 
in sequential as well as OpenMP multiple processor modes. 



INTRODUCTION 

In this work we consider the efficient computation of 
inner and outer products of arbitrary multi-dimensional 
arrays (tensors). Our algorithm was presented in a pre- 
vious work and was derived and expressed using the for- 
malism known as A Mathematics of Arrays (MoA) 
The routine maximizes data locality and computes both 
operations (inner and outer product) in a single piece of 
code. In this work we emphasize computational exper- 
iments and refer the reader to Ref [1] for details of the 
formalism and the derivation. 

We now give a brief schematic discussion of the algo- 
rithm. Using traditional notation (as opposed to MoA), 
an outer product of two multi-dimensional arrays (ten- 
sors) A and B, is given in terms of components of the 
result: 
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The MoA outer product is more general than given above 
in that the binary operator * (times) is generalized to be 
any binary operation (e.g. +, — , *, /, etc.). 

The MoA inner product is equivalent to a tensor con- 
traction. Working with the above arrays, we would write: 
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where, as in the case of the MoA outer product, the bi- 
nary operation * (times) can be any binary operation. 
From Eq.[2]we can conclude two things: (1) the standard 
matrix multiply between two matrices A and 5 is a spe- 
cial case of the MoA inner product, and (2) the MoA in- 
ner product is intimately related to the MoA outer prod- 
uct of Eq. [TJ It is therefore natural that both operations 
should be embodied in the same piece of code. 

Any arbitrary pair of arrays can be handled because of 
the generality of the formalism and implementation. The 
concept of array shape plays a key role. The shape of 
an array is given by a vector whose components give the 
lengths of the corresponding dimensions. Thus an array 
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FIG. 1: Illustration of the general algorithm for the matrix 
multiply (inner product) operation. Data locality is maxi- 
mized in that an entire row of the result, C, is computed at 
once while the elements of the left array A and right array B 
are stored and accessed contiguously. In this figure we illus- 
trate how the first row (indicated by < > ipC is computed). 



input to the routine is described by the shape vector and 
a vector containing the elements of the array. These con- 
cepts are illustrated in Fig. [TJ In this example, we take 
the arrays A and B to be two dimensional (i.e. matrices). 
The array A has shape pA =< 2 3 > and B has shape 
pB =<3 4>. In traditional language we would say that 
A is a 2 X 3 matrix and 5 is a 3 x 4 matrix. 

Data locality is maximized in that each row of the re- 
sult (indicated by < > V^C) is computed at a time and 
the elements of A and B are accessed contiguously. In 
this figure we have illustrated some of the notational de- 
vices of MoA. In this formalism, the tjj operator selects 
components and subarrays of a given array with the use 
of an index vector. In this case we select the zero^th row 
of C with the operation < > ipC. We see that each row 
of B is multiplied by an element of A and then added to 
the next row of B multiplied by an element A. This for- 
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mulation of the inner product might seem simple but is 
actuahy quite subtle in the general case of two arbitrary 
multi-dimensional arrays. This way of organizing the op- 
erations leads to significant performance gains as will be 
demonstrated in the numerical tests to be described be- 
low. 

In the rest of this paper we present performance tests 
of our routine for the computation of the standard ma- 
trix multiply in comparison with a benchmark routine 
(dgemm.f) taken from the BLAS library. We present 
tests of both sequential and OpenMP parallel implemen- 
tations. We find that our routine is either competative 
or outperforms the BLAS routine. 

As discussed more extensively below, our goal is to 
demonstrate the advantages of our design methodology 
using MoA. We don't claim to have established the the 
best matrix multiply and in no way do we wish to en- 
ter such a competition. Indeed the matrix multiply has 
been extensivelystudied 0, S, S, S, @, 0, @, B llil, 
H, [H, Q [H, lii, [13 and we defer to the experts for 
those searching for the best matrix multiply. For our 
OpenMP version we also make no attempt at optimiza- 
tion. We simply adopt a "poor man's" parallelism by 
wrapping sequential code with the simplest OpenMP 
statements (not even specifying a "chunk" size, for exam- 
ple). The point is that we take an "off the shelf sequen- 
tial benchmark, the standard BLAS routine dgemm.f and 
compare it with our generalized inner and outer prod- 
uct code for the limited case of matrix multiply and we 
find competative results without any optimizations other 
than the fortran compiler options — OO, —01, —02 and 
—03. For high performance parallel matrix multiply we 
again defei^to the experts cited above as well as those in 
Refs. [lal, y, y, l2l|, I22I, l23|, l25|, y, l27| . 



NUMERICAL EXPERIMENTS 
Computational Environment 

A series of sequential and multi-processor tests were 
carried out for the MoA routine in comparison with the 
standard BLAS dgemm.f. The key code fragments are 
presented in the Appendix. A dedicated, non-shared, 
computational environment was used on the 5,120 pro- 
cessor machine "jaws" at the Maui High-Performance 
Computing Center. The following information is quoted 
from the website (www.mhpcc.hpc.mil): 

"Jaws is a Dell PowerEdge 1955 blade server cluster 
comprised of 5,120 processors in 1,280 nodes. Each node 
contains 2 Dual Core 3.0 GHz 64-bit Woodcrest CPUs, 
8GB of RAM, and 72GB of local SAS disk. Addition- 
ally, there is 200TB of shared disk available through the 
Lustre filesystem. The nodes are connected via Cisco 
Infiniband, running at lOGbits/sec (peak). Jaws has a 



peak performance of 62400 GFlops, and LINPACK per- 
formance of 42390 GFlops." 

In the following we will present results for our new 
routine run in sequential and OpenMP multi-threaded 
tests. 



Tests for matrix-matrix multiply 

Our object of study is the computation of the matrix 
multiply C = B where we consider C to be a m x n 
matrix, where m is the number of processors (threads) 
and n is an integer power of 2 and is varied from the 
smallest to largest sizes that can be accomodated. The 
matrix A has dimensions m x i while B has dimensions 
£xn. For these tests we keep £ fixed at the value £ = 128. 

There are two performance metrics of interest in this 
study: (1) the "time per thread" and (2) the "total time". 
For a multi-thredded job the "time per thread" is simply 
the total time for the job to run. Note that a job with 
m threads is dealing with a problem size that is m times 
as large as the problem considered on 1 thread. Thus 
if there were no communcation costs we would expect 
the curve of "time per thread" vs. n to be the same, 
independent of the number of threads m. 

In some cases we wish to consider a fixed problem size 
and see how long it takes on m = 1, 2, 3, and 4 processors. 
In this case we take the curves discussed in the previous 
paragraph and scale the x axis (i.e. problem "size") of 
each curve by multiplying by the corresponding number 
of processors m. This type of plot should explicitly show 
the benefit of parallelism (if there is one) if the curve for 
m threads is below that for 1 thread. 



Sequential tests 

In a first series of numerical experiments we tested the 
MoA routine and the BLAS dgemm.f routine in sequen- 
tial mode in a dedicated non-shared batch environment. 
Perl scripts were used in each case to compile the routine 
for a given value of n and then the job is timed. This 
process is repeated three times for each n and the timings 
were averaged. As reproducibility is a key concern, we 
also repeated several tests on different days of the week to 
make sure there were no substantial fiuctuations. In all 
cases tested we found essentially identical results. From 
these careful considerations we conclude that all results 
presented in this work are reproducible. 

Our initial interest was in determining the effect of 
compiler options on the performance of our routine and 
the BLAS routine. We thus ran our experiments with the 
four optimization fiags: — OO (no optimization), — Ol, 
—02 and —03 in four separate tests respectively. We 
used the Intel Fortran compiler "ifort" that was supplied 
with the machine. In all cases we found the compiler 
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option —01 to give the best performance. Thus in ah 
results to be presented, we assume the compiler option 
—01 to be in effect. Interestingly, for the Mo A routine, 
we find a benefit on going from — OO to —01 but then no 
difference between —01, —02, and —03. In constrast, 
however, for the BLAS routine we find the speed to in- 
crease upon going from — OO to —02 and then to increase 
again upon going from —02 to —01 while the results for 
— 02 and —03 were essentially identical. 

Comparison of the MoA routine vs. the BLAS routine 
dgemm.f run in sequential mode are presented in Fig. [2l 
We see that the results are comparable with a slight ben- 
efit given by the MoA routine for small sizes. For the 
largest sizes that fit in real memory, the BLAS routine 
outperformes the MoA routine but for sizes requiring vir- 
tual memory the results are essentially the same. We will 
see that the superior performance of the BLAS routine at 
largest sizes is lost when we go to multiple threads using 
OpenMP. 
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FIG. 2: Best sequential MoA result compared with the best 
sequential BLAS result (flag -01). 



Multiple-thread OpenMP 

Our next set of computational experiments were per- 
formed using OpenMP multiple threads. On this ma- 
chine (see description in first section) each node contains 
four processors (two dual cores) and so we restrict our 
attention in this series of experiments to m = 1, 2, 3, 
and 4 threads. Making the transition from a sequential 
piece of code to open OpenMP is achieved by wraping the 
seqential algorithm, in each case with simple OpenMP di- 
rectives. No attempt was made to optimize the parallel 
performance of either routine. 

Our goal in the multi-threaded tests to be discussed 
herein is as follows. We are proceeding from a gen- 
eral, mathematically-based design methodology. Thus 
although our code was not designed to specifically ex- 
ploit the mult i- threading capabilities of this machine, we 
achieve impressive results in comparison with the BLAS 
benchmark. Again, we emphasize the design methodol- 
ogy. Our approach is completely mechanizable from the 
ONF (Operational Normal Form). That is given start, 
stop, stride, count, we can instantiate the software at 
any level of memory [H, [H, [s^, [HI, |32[. We are not 
trying to claim that we have achieved the fastest multi- 
threaded matrix multiply. Nor are we comparing our 
results against a BLAS routine that has been designed 
for multi-processor, multi-threaded hardware. That is 
not our goal, but rather to argue the merits of a de- 
sign methodology that consistently leads to efficient im- 
plementations by eliminating temporaries and exploiting 
data locality. 

Figure [3] presents results for m = 1, 2, 3, and 4 
OpenMP threads for the MoA routine. We plot the 
time/thread for each job. In other words this is the 
total time for the job to run with the size of the problem 
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FIG. 3: Comparison of the time/thread for threads m — 1, 
2, 3 and 4 assuming the fastest compiler option in each case 
(i.e. -01) for the MoA routine. Note, the problem size is 
proportional to the number of threads m. Thus the differences 
between the four curves represent communication costs. 

proportional to the number of threads m. This met- 
ric illustrates the communication cost associated with 
multiple threads because, in the absence of communi- 
cation cost (i.e. in a situation of "perfect parallism") 
the time/thread vs. n (i.e. the number of columns of 
C = AB) would be independent of the number of threads 
m. 

In Fig. [4] we emphasize the net benefft of the use of 
multiple threads by considering the "total time" vs. the 
size of the problem. In other words, in this case, for each 
value of m (i.e. the number of threads) we scale the x- 
axis of the "time/thread" plot illustrated in Fig. [3] by 
m. Thus for a given value of n, if the curve for a given 
number of threads lies below that for m = 1, there is a 
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FIG. 4: Comparison of the total time for threads m — 1, 
2, 3 and 4 assuming the fastest compiler option in each case 
(i.e. -01) for the MoA routine. These curves were obtained 
from the ones of Fig. [3] by rescahng the x-axis of each curve 
by multiplying by the corresponding number of processors m. 
Thus, in this case the x axis (n) represents the total problem 
size. Note for this range of n, there is benefit in going from 
m = 1, to 2 but there is no net benefit in using m = 3 and 4. 
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FIG. 5: Comparison of the time/thread for threads m = 1, 
2, 3 and 4 assuming the fastest compiler option in each case 
(i.e. -01) for the BLAS routine dgemm.f. Note, the problem 
size is proportional to the number of threads m. Thus the 
differences between the four curves represent communication 
costs. 



net benefit to using multiple threads. We see that in this 
series of tests, there IS a net benefit to the use of m = 2 
threads but there is no net benefit for m > 2 threads. 

In Fig. [5] we plot the "time/thread" for the BLAS rou- 
tine with m = 1, 2, 3 and 4 threads. The curves look 
similar to those for the MoA routine of Fig. [H but as 
we will see the following figure there is a fundamental 
difference. 

The results for the "total time" vs. n for the BLAS 
routine are presented in Fig. [6l The results of Fig. [6l 



FIG. 6: Comparison of the total time for threads m — 1, 
2, 3 and 4 assuming the fastest compiler option in each case 
(i.e. -01) for the BLAS routine dgemm.f. These curves were 
obtained from the ones of Fig. [5] by rescaling the x axis of 
each curve by multiplying by the corresponding number of 
processors m. Thus, in this case the x axis (n) represents the 
total problem size. Note for this range of n, there is benefit 
in going from m = 1, to 2 but there is no net benefit in using 
m = 3 and 4. 



for the BLAS routine are fundamentally different from 
those for the MoA routine of Fig. [H in that, while all the 
m > 1 curves of Fig. [4] lie below the curve for m = 1, 
in Fig. [6] we see all m > 1 curves lie above the m = 1 
result. Thus for this series of experiments there is 
no net benefit to the use of multiple threads for 
the BLAS routine. 

In the next four figures we compare the "time/thread" 
for the MoA routine, directly with the BLAS routine. 
In Fig. [71 we compare the one-thread result for the MoA 
routine with the BLAS routine. We find that the MoA 
result out performs the BLAS routine for small matrix 
sizes and is equivalent to that of the BLAS routine for 
the largest sizes. Note that this figure should be directly 
compared with Fig. [2] for the sequential runs. We see 
that, although BLAS had the advantage for the largest 
sizes when running in sequential mode, the advantage is 
lost when going to m = 1 thread using OpenMP. 

For Figs. [HI through [10] we see that the MoA routine 
consistently out performs the BLAS routine for all sizes 
that fit into main memory. The BLAS routine out per- 
forms the MoA routine for sizes that only fit in virtual 
memory. The success of the MoA routine is due to the 
data locality of the algorithm's contiguous access of all 
arguments: an output with two inputs. 

Another way to characterize the data we have 
considered so far is as follows. We consider the 
"time/processor" for each number of threads m divided 
by the result for a single thread. Such results are pre- 
sented in Figs. [TTl and [T2l for the BLAS routine and the 
MoA routine respectively. As argued previously, such a 
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FIG. 7: Comparison of the MoA routine with the BLAS rou- 
tine dgemm.f for one thread. The MoA routine is superior for 
smaher sizes while the two are equivalent for large sizes. 



FIG. 10: Comparison of the MoA routine with the BLAS rou- 
tine dgemm.f for four threads. The MoA routine is superior 
for all sizes that fit in real memory. 
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FIG. 8: Comparison of the MoA routine with the BLAS rou- 
tine dgemm.f for two threads. The MoA routine is superior 
for all sizes that fit in real memory. 



O 

in 



1000 




0.01 



100000 



le+06 

Size n 



le+07 



FIG. 9: Comparison of the MoA routine with the BLAS rou- 
tine dgemm.f for three threads. The MoA routine is superior 
for all sizes that fit in real memory. 



ratio should illustrate the effects of communication costs. 
If there were no communication cost, each ratio would be 
unity. Next, we consider the notion that if the ratio is 
greater than m, there is no net benefit to using multiple 
threads as this would indicate that the job was more ex- 
pensive than m sequential jobs. For these (unoptimized) 
tests we conclude from Figs. [11] and [12] that there is no 
net benefit to the use of OpenMP multiple threads for 
the BLAS routine while there IS a net benefit to such use 
for the MoA routine. 

Again, we emphasize that this result is not definitive 
for establishing a matrix multiply that is superior to 
BLAS. Indeed there are BLAS routines (and others) that 
are optimized for multiple threads and multiple proces- 
sors [11, 18, 19, 20, 23, 24, 25, 26, 27, 33, 34, 35, 36]. We 
only emphasize the quality of our results as an advertise- 
ment for our methodical software design approach that 
exploits data locality as a fundamental principle. 



CONCLUSION 

We have presented numerical tests of a generalized in- 
ner and outer product routine applicable to arbitrary 
mult i- dimensional tensors specified at run time. Our al- 
gorithm computes either operation in a single piece of 
code. In this work we have focused on the limited case of 
matrix-matrix multiplication and have tested its perfor- 
mance for matrices from small sizes to the largest that 
can be possibly accomodated. As a benchmark reference 
we compare our results with the standard BLAS dgemm.f 
routine. We find that our routine is competative or out 
performs the BLAS routine. We have also presented tests 
of our routine using OpenMP parallelization without any 
machine specific optimizations (other than the compiler 
options — OO, —01, —02 and —03). Again we find com- 
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FIG. 11: Ratio of the time/ thread, for a given number of 
threads, m, to that for one thread for the BLAS routine 
dgemm.f. This metric allows one to judge the benefit of par- 
allelism. If there were no communication costs, such a ratio 
would be unity indicating "perfect parallelism". If this ra- 
tio is greater than m (as in this figure) then the overhead is 
more expensive than running m jobs sequentially ( "perfect se- 
quentialism" (?)). These results for this (unoptimized) routine 
indicate no benefit to the use of multiple threads. 
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FIG. 12: Ratio of the time/thread, for a given number of 
threads, m, to that for one thread for the MoA routine. This 
metric allows one to judge the benefit of parallelism. If there 
were no communication costs, such a ratio would be unity in- 
dicating "perfect parallelism" . If this ratio is greater than m 
then the overhead is more expensive than running m jobs se- 
quentially ("perfect sequentialism" (?)). For this (MoA) rou- 
tine the ratio is intermediate between unity and m, in each 
case, indicating a net benefit to the use multiple threads. 



c$OMP do private (i, j ,1) 

DO 90 J = 1,N 

IF (BETA. EQ. ZERO) THEN 
DO 50 I = 1,M 

C(I,J) = ZERO 
50 CONTINUE 

ELSE IF (BETA. NE. ONE) THEN 
c do 60 vectorized 

DO 60 I = 1,M 

C(I,J) = BETA*C(I,J) 
60 CONTINUE 
END IF 

DO 80 L = 1,K 

IF (B(L, J) .NE.ZERO) THEN 
TEMP = ALPHA*B(L,J) 

c do 70 vectorized 

DO 70 I = 1,M 

C(I,J) = C(I,J) + TEMP*A(I,L) 
70 CONTINUE 

END IF 
80 CONTINUE 
90 CONTINUE 
c$OMP end do nowait 



FIG. 13: Key code fragment used in the benchmark tests 
of the BLAS routine. The first and last lines are the only 
OPenMP directives used in these tests and are the same as 
those used in the MoA tests. 



c$OMP do private(k,i,l, j) 

do 100 k=0, (nthreads-1) 

do 120 i=k, (noproc-1) ,nthreads 
do 140 1=0, (rowsinred-1) 
do 160 j=0, (elsinop-1) 

RESADDRd + (i*restride) + j) = 

RESADDRd + (i*restride) + j) + 
LADDRd + 1 + (i*lstride))* 
RADDRd + (l*rstride) + j); 





+ 




+ 
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160 




140 




120 




100 





continue 
continue 
continue 
continue 
c$OMP end do nowait 



FIG. 14: Key code fragment used in the tests of the MoA 
routine. The first and last lines are the only OPenMP direc- 
tives used in these tests and are the same as those used in the 
BLAS tests. 



petative performance. As stated earlier, our goal is not to 
claim the best matrix multiply routine but rather to give 
definative tests of our routine for a well studied exam- 
ple: matrix multiply. Rather, we wish to emphasize the 
generality of our routine that can compute the inner and 
outer product between two arbitrary multi-dimensional 
arrays, as specified at run time, in a single piece of code. 



CODE FRAGMENTS FOR THE NUMERICAL 
EXPERIMENTS 



This appendix presents the key code fragments used in 
the testing of the BLAS routine (Fig. [13|) and the MoA 
routine (Fig. [Hj) 
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